Advances in Artificial Intelligence for Data Visualization: Automated Reading of Residual Plots with Computer Vision

Pre-submission Review

Weihao (Patrick) Li

Monash University

✍️Supervisors

Professor Dianne Cook, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

Dr. Emi Tanaka, Biological Data Science Institute, Australian National University, Canberra, Australia

Assistant Professor Susan VanderPlas, Statistics Department, University of Nebraska, Lincoln, USA

Senior Lecturer Klaus Ackermann, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

📚Thesis Structure

  1. Exploring the application of visual inference in regression diagnostics and comparing it with conventional hypothesis tests.

  2. Designing an automated visual inference system to assess residual plots of classical normal linear regression model.

  3. Deploying the automatic visual inference system as an online application and publishing the relevant open-source software.

This presentation will be focused on the second project.

🔍Regression Diagnostics

Diagnostics are the key to determining whether there is anything importantly wrong with a regression model.


\[\underbrace{\boldsymbol{e}}_\textrm{Residuals} = \underbrace{\boldsymbol{y}}_\textrm{Observations} - \underbrace{f(\boldsymbol{x})}_\textrm{Fitted values}\]

Graphical approaches (plots) are the recommended methods for diagnosing residuals.

🤔Challenges

  • Vertical spread of the points varies with the fitted values indicates the existence of heteroskedasticity.

  • However, this is an over-interpretation.

  • The visual pattern is caused by a skewed distribution of the predictor.

🔬Visual Inference

The reading of residual plots can be calibrated by an inferential framework called visual inference (Buja, et al. 2009).

Typically, a lineup of residual plots consists of

  • one data plot
  • \(19\) null plots containing residuals simulated from the fitted model.

To perform a visual test

  • Observer(s) will be asked to select the most different plot(s).
  • The p-value can be calculated using the beta-binomial model (VanderPlas et al., 2021).

🚫Limitations of Lineup Protocol

  1. Human can not
  • evaluate lineup consisted of a large number of plots.
  • evaluate a large number of lineups.
  1. Evaluation of lineup is high in labour cost and time consuming.

🤖Computer Vision Model

Modern computer vision models are well-suited for addressing this challenge.

It is usually built on a deep neural network called convolutional neural network (CNN).

Source: https://en.wikipedia.org/wiki/Convolutional_neural_network

📏Measure the Difference

To develop a computer vision model for assessing lineups of residual plots, we need to define a numerical measure of “difference” or “distance” between plots.

  • pixel-wise sum of square differences
  • Structural Similarity Index Measure (SSIM)
  • scagnostics
  • metric learning

🎲Residual Distribution

Consider the classical normal linear regression model

\[\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]

By the Frisch-Waugh-Lowell theorem,

\[\boldsymbol{e} = \boldsymbol{R}\boldsymbol{\varepsilon},\] where \(\boldsymbol{R}=\boldsymbol{I}_n -\boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\).

We treat \(\boldsymbol{e}\) as a vector of random variables here.

🎲Residual Distribution

The minimization of the sum of square residuals implies
\[\sum_{i=1}^{n} e_i = 0 \quad\text{and}\quad \text{rank}(\boldsymbol{R}) = n - 1.\]

Thus, \(\boldsymbol{e}\) follows a degenerate multivariate distribution.

For simplicity, we will replace \(\text{cov}(\boldsymbol{e}, \boldsymbol{e}) = \boldsymbol{R}\sigma^2\boldsymbol{R}' = \boldsymbol{R}\sigma^2\) with a full-rank diagonal matrix \(\text{diag}(\boldsymbol{R}\sigma^2)\).

Then for a correctly specified model, \[\boldsymbol{e} \sim N(\boldsymbol{0}_n, \text{diag}(\boldsymbol{R}\sigma^2)).\]

Symbol \(Q\) will be used to represent this reference residual distribution.

🎲Residual Distribution

However, if the model is misspecified, then the actual residual distribution denoted as \(P\), will be different from \(Q\).

  • For example, if \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n,\boldsymbol{V})\), where \(\boldsymbol{V} \neq \sigma^2\boldsymbol{I}_n\), \[\boldsymbol{e} \sim N(\boldsymbol{0}_n, \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})) \implies \text{Heteroskedasticity}.\]

  • And if some necessary higher-order predictors \(\boldsymbol{Z}\) are also omitted, \[\boldsymbol{e} \sim N(\boldsymbol{R}\boldsymbol{Z}\boldsymbol{\beta}_z, \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})) \implies \text{Non-linearity}~ \& ~\text{Heteroskedasticity}.\]

📏KL Divergence of \(P\) from \(Q\)

We defined a distance measure based on Kullback-Leibler divergence to quantify the extent of model violations

\[\begin{align} \label{eq:kl-0} D &= \log\left(1 + D_{KL}\right), \\ \label{eq:kl-1} D_{KL} &= \int_{\mathbb{R}^{n}}\log\frac{p(\boldsymbol{e})}{q(\boldsymbol{e})}p(\boldsymbol{e})d\boldsymbol{e}, \end{align}\]

where \(p(.)\) and \(q(.)\) are the probability density functions for distribution \(P\) and \(Q\) respectively.

  • \(D = 0\) if and only if \(P \equiv Q\).

📝Evaluation of \(D\) for Normal \(\boldsymbol{\varepsilon}\)

For a classical normal linear regression model that omits necessary higher-order predictors \(\boldsymbol{Z}\), and incorrectly assumes \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n,\sigma^2\boldsymbol{I}_n)\) while in fact \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \boldsymbol{V})\) with \(\boldsymbol{V} \neq \sigma^2\boldsymbol{I}_n\), the KL divergence can be further expanded to

\[\begin{align} \label{eq:kl-2} D_{KL} &= \frac{1}{2}\left(\log\frac{|\boldsymbol{W}|}{|\text{diag}(\boldsymbol{R}\sigma^2)|} - n + \text{tr}(\boldsymbol{W}^{-1}\text{diag}(\boldsymbol{R}\sigma^2)) + \boldsymbol{\mu}_z'\boldsymbol{W}^{-1}\boldsymbol{\mu}_z\right), \end{align}\]

where \(\boldsymbol{\mu}_z = \boldsymbol{R}\boldsymbol{Z}\boldsymbol{\beta}_z\), and \(\boldsymbol{W} = \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})\).

  • The assumed error variance \(\sigma^2\) is set to be \(\text{tr}(\boldsymbol{V})/n\), which is the expectation of the estimated variance.

📝Evaluation of \(D\) for Non-normal \(\boldsymbol{\varepsilon}\)

For non-normal \(\boldsymbol{\varepsilon}\), \(P\) is unlikely to be a well-known distribution.

  • difficult to compute \(p(\boldsymbol{e})\).
  • difficult to do numerical integration over the \(n\) dimensional space, because \(n\) could be potentially very large.

Therefore, the elements of \(\boldsymbol{e}\) are assumed to be independent of each other.

📝Evaluation of \(D\) for Non-normal \(\boldsymbol{\varepsilon}\)

Under the independence assumption, the integration becomes a summation

\[\begin{align} \label{eq:kl-3} D_{KL} &\approx \sum_{i = 1}^{n} \hat{D_{KL}^{(i)}}, \\ \hat{D_{KL}^{(i)}} &= \frac{1}{m}\sum_{j = 1}^{m} log\frac{\hat{p_i}(B_{ij})}{q(B_{ij})}, \end{align}\]

where \(B\) is a matrix with \(n\) rows and \(m\) columns, and each column is a set of simulated residuals, \(m\) is the number of simulations, \(\hat{p_i}(.)\) is the kernel density estimator of \(p_i(.)\) using simulated residuals in \(B_i\).

  • The assumed variance for \(q(.)\) is set to be \(\sum_{b \in vec(B)}(b - \sum_{b \in vec(B)} b/nm)^2/(nm - 1)\).

🎯Approximation of the Distance

Let’s get back to the original problem. We have a fitted model on hand, and we want to know how different the residuals are from a set of good residuals.

  • However, the data generating process is typically unknown!

    • The distance can not be computed!

We can train a computer vision model to approximate \(D\) with a residual plot

\[\begin{equation} \label{eq:d-approx} \widehat{D} = f_{CV}(V_{h \times w}(\boldsymbol{e}, \boldsymbol{\hat{y}})), \end{equation}\]

where \(V_{h \times w}(.)\) is a plotting function that saves a residual plot as an image with \(h \times w\) pixels, and \(f_{CV}(.)\) is a computer vision model which predicts distance in \([0, +\infty)\).

🔄Generation of Training Data

Data generating process

\[\begin{align} \label{eq:data-sim} \boldsymbol{y} &= \boldsymbol{1}_n + \boldsymbol{x}_1 + \beta_1\boldsymbol{x}_2 + \beta_2(\boldsymbol{z} + \beta_1\boldsymbol{w}) + \boldsymbol{k} \odot \boldsymbol{\varepsilon}, \\ \boldsymbol{z} &\propto \text{He}_j(\boldsymbol{x}_1), \\ \boldsymbol{w} &\propto \text{He}_j(\boldsymbol{x}_2), \\ \boldsymbol{k} &= \sqrt{\boldsymbol{1}_n + b(2 - |a|)(\boldsymbol{x}_1 + \beta_1\boldsymbol{x}_2 - a\boldsymbol{1}_n)^2}, \end{align}\]

where \(\text{He}_j(.)\) is the \(j\)th-order probabilist’s Hermite polynomials, and the \(\sqrt{(.)}\) and \((.)^2\) operators are element-wise operators.

  • The residuals are obtained by regressing \(y\) on \(x_1\). If \(\beta_1 \neq 0\), \(x_2\) will also be included as a predictor.

💡Non-linearity

Factor \(j\)

Factor \(\sigma_\varepsilon\)

💡Heteroskedasticity

Factor \(a\)

Factor \(b\)

💡Non-normality

Distribution of \(\varepsilon\)

💡Multiple Model Violations

Non-linearity + Heteroskedasticity

Non-normality + Heteroskedasticity

💡Predictor Distribution

Distribution of predictor

💡Second Predictor

Non-linearity + Second predictor

🏛️Model Architecture

The architecture of the computer vision model is adapted from VGG16 (Simonyan and Zisserman 2014).

🏋️‍♂️Model Training

The computer vision model is trained on the M3 high-performance computing platform (www.massive.org.au), using TensorFlow (Abadi et al. 2016) and Keras (Chollet et al. 2015).

  • The training, validation and test set contains 64000, 16000 and 8000 images respectively.

  • The distribution of the target variable \(D\) is controlled such that it roughly follows a uniform distribution.

  • Multiple models with different image resolutions are trained, the optimized model has input size \(32 \times 32\).

🌐Overview of Model Performance

The model performs

  • better for non-normality
  • worse for non-linearity
  • consistently across different datasets

The model performs worst on null plots (D = 0) as expected.

📈Model Violations Index

For a consistent data generating process, \(D\) typically increases logarithmically with the number of observations.

This behaviour comes from the relationship \(D = \text{log}(1 + D_{KL})\), where \(D_{KL} = \sum_{i=1}^{n}D_{KL}^{(i)}\) under the assumption of independence.

Therefore, the Model Violations Index (MVI) for determining the extent of model violations can be proposed as

\[\begin{equation} \label{eq:mvi} \text{MVI} = C + \widehat{D} - log(n), \end{equation}\]

where \(C\) is a large enough constant keeping the result positive.

📈Model Violations Index

📈Model Violations Index

🔬Statistical Testing Based on the Approximated Distance

We have proposed a statistical test based on the approximated distance

  • The null distribution can be estimated by predicting approximated distance for a large number of null plots.

  • The critical value can be estimated by the sample quantile (e.g. \(Q_{null}(0.95)\)) of the null distribution.

  • The \(p\)-value is the proportion of null plots having approximated distance greater than or equal to the observed one.

🔬Statistical Testing Based on the Approximated Distance

🧪Human Subject Experiment

We have collected \(7974\) responses to \(1152\) lineups in a human subject experiment conducted in the first project (Li et al. 2023).

Differences:

  • non-linearity and heteroskedasticity do not co-exist in one lineup.
  • non-normality and multiple predictors are not considered in the experimental design.

🧪Compared to Visual Tests

🧪Compared to RESET and BP Tests

🧪Power Comparison

💡Example: Left-triangle

💡Example: Left-triangle

💡Example: Boston Housing

\(MEDV = \beta_0 + \beta_1RM + \beta_2LSTAT + \beta_3PTRATIO + \varepsilon\)

💡Example: Boston Housing

💡Example: Dinosaur

\(y = \beta_0 + \beta_1x + \varepsilon\)

💡Example: Dinosaur

🎬Conclusions

In this study, we have

  • introduced a distance measure to effectively captures the magnitude of model violations

  • proposed and trained a computer vision model to approximate this distance

  • developed a statistical testing procedure and a Model Violations Index based on the approximated distance

🎬Conclusions

  • The proposed test shows comparable power to conventional ones, but there’s room for further enhancing its robustness against minor deviations from model assumptions.

  • Our method offers significant value by easing analysts’ workload in assessing residual plots.

  • While we encourage continued manual review of residual plots for insightful analysis, our approach serves as a valuable tool for automating or supplementing the diagnostic process.

📅Timeline

Date Event
Mar Finish and submit the second paper
April Develop a web interface for the automatic visual inference system
May Clean and publish R packages to CRAN
July Finish the third paper
Attend the useR! conference
Aug Submit thesis

Thanks! Any questions?